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(54) Compiler for increased data cache efficiency 

(57) A compiler that facilitates efficient insertion of 
explicit data prefetch instructions into loop structures 
within applications uses simple address expression 
analysis to determine data prefetching requirements. 
Analysis and explicit data cache prefetch instruction 
insertion are performed by the compiler in a machine- 
instruction level optimizer to provide access to more 
accurate expected loop iteration latency information. 
Such prefetch instruction insertion strategy tolerates 
worst-case alignment of user data structures relative to 
data cache lines. Execution profiles from previous runs 
of an application are exploited in the insertion of 
prefetch instructions into loops with internal control flow. 
Cache line reuse patterns across loop iterations are 
recognized to eliminate unnecessary prefetch instruc- 
tions. The prefetch insertion algorithm is integrated with 
other low-level optimization phases, such as loop unroll- 
ing, register reassociation, and instruction scheduling. 
An alternative embodiment of the compiler limits the 
insertion of explicit prefetch instructions to those situa- 
tions where the lower bound on the achievable loop iter- 
ation latency is unlikely to be increased as a result of the 
insertion. 
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Description 

i 

j The invention relates to techniques for reducing data cache overhead in a computer system. More particularly the 
invention relates to compiler-related techniques that are useful for reducing data cache overhead. 

] Data cache misses (described in greater detail below) can account for a significant portion of an application pro- 
gram's execution time on modern processors. This is particularly true in the case of scientific applications that manipu- 
late large data structures which run on high frequency processors having long memory latencies. With increasing 
mismatch between processor and memory, the high penalty of cache misses has become and continues to be a dom- 
inant performance limiter of microprocessors. Increasing the cache size is one way to reduce cache misses. However, 
because the size of many numerical applications is also growing rapidly from generation to generation, the first level 
cache may not always be large enough to capture critical working sets. 

j Most modern computer systems employ such caches to bridge the gap between memory and processor speeds 
However, despite high cache hit ratios, the cost of cache misses in high frequency processors can significantly degrade 
» runtime performance. To illustrate this point, a plausible scenario has been suggested where a cache miss penalty is 
100 processor cycles and a data reference occurs every four cycles. See. for example Alexander C. Klaiber, Henry M 
Levy. An Architecture for Software-Controlled Data Prefetching. Proceedings of the 18th Annual International Sympo- 
sium on Computer Architecture. May 1 991 . Even assuming a cache hit ratio of 99%. the processor is stalled for memory 
20% of the time. 7 

One way to ameliorate the high overhead of data cache misses is to overlap the fetching of data from memory to 
the, data cache with other useful computations. Certain high-performance superscalar microprocessors are able to 
achieve some degree of overlap between data cache miss handling and processor computation automatically through 
out-of-order instruction execution, facilitated by instruction queues capable of holding renamed register results, in con- 
junction with a split-transaction memory bus (e.g. such microprocessors as the Silicon Graphics T5, Hewlett-Packard 
PA8000. and Sun Ultrasparc). However, the degree of overlap typically achieved is insufficient to fully cover an external 
data cache miss latency. 

Some of these microprocessors support explicit data prefetch instructions that may be used to reduce the high 
overhead of data cache misses more effectively. Such instructions are typically defined to initiate data cache miss han- 
dling without holding up instruction execution until the referenced data is retrieved from memory. 

By inserting explicit data cache prefetch instructions into the code stream, a compiler can help ameliorate the high 
cost of data cache misses. However, this approach must be implemented judiciously because explicit cache prefetch 
instructions, in general, increase the dynamic path length of an application, and the added overhead may not be offset 
by a corresponding decrease in data cache miss overhead. 

There is much published literature on cache design trade-offs, and hardware approaches to improving cache per- 
formance. Comparatively, however, there is much less literature on improving cache performance through software- 
controlled active cache management. The few papers that discuss software-controlled data prefetching to improve 
cache performance include Todd C. Mowry. Monica S. Lam, Anoop Gupta, Design and Evaluation of a Compiler Algo- 
rithm for Prefetching. Proceedings of the 5th International Conference on Architectural Support for Programming Lan- 
guages and Operating Systems. October 1992; Alexander C. Klaiber. Henry M. Levy. An Architecture for Software- 
Corrtrolled Data Prefetching, Proceedings of the 18th Annual International Symposium on Computer Architecture. May 
199,1 (an approach based on hand analysis); and Software Prefetching, David Callahan, Ken Kennedy, Allan Porter- 
field, Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Oper- 
ating Systems, April 1991 (where a prefetch instruction is added for each loop body memory reference without 
considering or exploiting cache line re-use, such that there is no selectivity; and where the prefetch insertion is per- 
formed at the source-code level, such that there is little integration with other compiler optimization phases; additionally, 
because the analysis is done at the source code level, it is difficult to estimate the prefetch iteration distance (PFID). i.e! 
the PFID used is always one loop iteration, which may be insufficient to hide the full cache miss latency). 

These papers concentrate on explicit prefetches for subscripted variables that are referenced in loops. They do not 
discuss insertion of explicit prefetch instructions into straight-line code for scalar or indirect memory references. Fur- 
thermore, it is generally assumed that the arrays of interest are all aligned on cache line boundaries. 

There are some general observations that are more or less common to the different studies of software-controlled 
data prefetching. One such observation is that data prefetching does not come for free. Specifically, explicit prefetches 
use up instruction issue bandwidth. In addition to the prefetch instruction itself, typically one or more instructions are 
needed to compute the address of the memory location to be prefetched. Recycling the computed prefetch address for 
the actual reference can involve tying up registers for extended lifetimes. The increased register pressure can result in 
the introduction of spill code in expensive loops. This can offset the expected performance gains due to prefetching. 

A simple prefetch strategy, such as the one proposed by David Callahan. Ken Kennedy. Allan Porterfield. Software 
Prefetching. Proceedings of the 4th International Conference on Architectural Support for Programming Languages 
and Operating Systems. April 1991, can wastefully increase the number of executed instructions through multipl 
prefetch requests for lines already in the data cache 
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Another important consideration cited by the different papers on software-controlled data prefetching is the actual 
placement of the prefetch instructions. If a prefetch is issued too close time-wise to the memory reference that needs 
to access the prefetched data, the prefetched data may not be available in time to avoid a CPU stall. On the other hand, 
if the prefetch is issued too early, there is a possibility of the prefetched line being displaced from the cache prematurely. 

Todd C. Mowry. Monica S. Lam, Anoop Gupta, Design and Evaluation of a Compiler Algorithm for Prefetching, Pro- 
ceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Sys- 
tems, October 1992, discuss the notion of identifying a prefetch predicate and the leading reference amongst multiple 
references to an array to facilitate selective prefetching. This paper also discusses the interaction of data prefetching 
with other compiler transformations, specifically cache blocking and software pipelining. The prefetching algorithm dis- 
closed is effective at reducing explicit data prefetch overhead. One shortcoming with this approach is that it relies on 
reuse and locality analysis that is rather complex. The analysis is done in the context of a high-level optimizer, which 
makes it difficult to estimate the prefetch iteration distance because the effects of downstream compiler components 
(e.g. code generator and low-level optimizer) on the loop body are unknown. It is also unclear how cache line alignment 
of prefetched data structures is accounted for when memory strides are greater than the cache line size. Also, it is 
unclear whether unnecessary prefetches are inserted for certain types of data reuse patterns. For instance, for the fol- • 
lowing C code fragment, the disclosed algorithm may actually insert three prefetches when two would be sufficient to : 
ensure full miss coverage. 

int A[100][100]; 

for (i = 0; i < 100; 

for G = 0; j < 100; j++) 

{ 

,.Alj-1][i]... 
..Affln+1]..; 
..A0+1][i-1]... 

} 



Assume that a target processor supports a data cache line prefetch instruction with the following characteristics: 

• It allows a memory address to be specified much like in an ordinary load or store instruction; 

• If the memory referenced by the prefetch instruction is not found in the data cache, the processor causes the ref- 
erenced memory location to be retrieved from lower levels of the memory hierarchy without stalling the execution 
of other instructions in the processor's execution pipelines; and 

• The processor does not signal an exception even when the memory address specified by a prefetch instruction is 
invalid. 

The current invention provides a new compiler for such a processor that facilitates efficient insertion of explicit data 
prefetch instructions into loops within application programs. The compiler uses simple subscript expression analysis to 
determine data prefetching requirements. Analysis and explicit data cache prefetch instruction insertion are performed 
by the compiler in a machine instruction level optimizer to provide access to more accurate expected loop iteration , 
latency information. 

Such a prefetch instruction insertion strategy tolerates worst case alignment of user data structures relative to data 
cache lines. Execution profiles from previous runs of an application are exploited in the insertion of prefetch instructions 
into loops with internal control flow. Cache line reuse patterns across loop iterations are recognized to eliminate unnec- 
essary prefetch instructions. The prefetch insertion algorithm is integrated with other low level optimization phases, 
such as loop unrolling, register reassociation. and instruction scheduling. 

An alternative embodiment of the compiler limits the insertion of explicit pr fetch instructions to those situations 
where the lower bound on the achievable loop iteration latency is unlikely to be increased as a result of the insertion. 

The invention will now be explained with reference to exemplary embodiments which are illustrated in the accom- 
panying drawings, in which: 

Fig. 1 is a block schematic diagram of a uniprocessor computer architecture including a processor cache; 
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Fig. 2 is a block schematic diagram of a modern software compiler; 

I 

, Fig. 3 is a schematic representation of a loop; 
5 i Fig. 4 is a schematic representation of a direct mapped data cache; 

Fig. 5 is a schematic representation of a loop, including a prefetch instruction; 
Fig. 6 is a schematic representation of a loop that has been unrolled four times; 

w 

Fig. 7 is a schematic representation of an unrolled loop, including a prefetch instruction; 

• Fig. 8 is a block diagram showing a low level optimizer for a compiler, including a prefetch driver according to the 
invention; 

15 I; 

i Fig. 9 is a block diagram of a prefetch driver according to the invention; 

Fig. 1 0 is a block diagram of a loop body anal/sis module according to the invention; 

so Fig. 1 1 is a block diagram of a module that is used to compute prefetch instruction needed for equivalence class 
according to the invention; and 

Fig. 12 is a block diagram of a module that applies a large stride cluster identifier to an equivalence class according 
to the invention. 

25 

: The invention provides a new compiler that facilitates efficient insertion of explicit data prefetch instructions into 
loops within applications. Fig. 1 is a block schematic diagram of a uniprocessor computer architecture 10 including a 
processor cache. In the figure, a processor 1 1 includes a cache 12 which is in communication with a system bus 15. A 
system memory 13 and one or more I/O devices 14 are also in communication with the system bus. 

30 ' Fig. 2 is a block schematic diagram of a software compiler 20, for example as may be used in connection with the 
computer architecture 10 shown in Fig. 1. The compiler Front End component 21 reads a source code file (100) and 
translates it into a high level intermediate representation (1 10). A high level optimizer 22 optimizes the high level inter- 
mediate representation 1 10 into a more efficient form. A code generator 23 translates the optimized high level interme- 
diate representation to a low level intermediate representation (120). The low level optimizer 24 converts the low level 

35 intermediate representation (120) into a more efficient (machine-executable) form. Finally, an object file generator 25 
writes out the optimized low-level intermediate representation into an object files (141). The object file (141) is proc- 
essed along with other object files (1 40) by a linker 26 to produce an executable file (1 50), which can be run on the com- 
puter 10. In the invention described herein, it is assumed that the executable file (150) can be instrumented by the 
compiler (20) and linker (26) so that when it is run on the computer 1 0, an an execution profile (160) may be generated, 

40 which can then be used by the low level optimizer 24 to better optimize the low-level intermediate representation (120). 
The compiler 20 is discussed in greater detail below. 

In contrast to previous approaches to the cache miss problem discussed above (see Todd C. Mowry, Tolerating 
Latency Through Software-Controlled Data Prefetching, PhD Thesis, Dept. of Electrical Engineering, Stanford Univer- 
sity, March 1994; D. Callahan, K. Kennedy, A. PorterfieW, Software Prefetching, Proceedings of the Fourth International 

« Conference on Architectural Support for Programming Languages and Operating Systems, pp. 40-52, April 1991 ; and 
W.Y. Chen, S.A. Mahlke, P.P. Chang, W.W. Hwu, Data access microarchitectures for superscalar processors with com- 
piler-assisted data prefetching, Proceedings of Microcomputing 24, 1991) the new compiler has the following unique 
attributes: 

so • Simple subscript expression analysis is used to determine data prefetching requirements, as opposed to sophisti- 
cated reuse/dependence analysis. 

Subscript expression analysis and explicit data cache prefetch instruction insertion are performed by the compiler 
in a low-level, i.e. machine instruction level, optimizer. A principal advantage of this approach is access to more 
55 accurate expected loop iteration latency information. 

• . The prefetch instruction insertion strategy tolerates worst case alignment of user data structures relative to data 
cache lines. 

r 
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• Execution profiles from previous runs of an application are exploited in the insertion of prefetch instructions into 
loops with internal control flow. 

Cache line reuse patterns across loop iterations are recognized to eliminate unnecessary pref tch instructions. 

5 i 

The prefetch insertion algorithm is integrated with other tow level optimization phases, such as loop unrolling, reg-j 
ister reassociation, and instruction scheduling. : 

An alternative embodiment of the new compiler also limits the insertion of explicit prefetch instructions to those sit-; 
w uations where the lower bound on the achievable loop iteration latency is unlikely to be increased as a result of the': 
insertion. 

The new compiler yields significant performance improvements for some industry-standard performance bench-i 
marks on simulations of the Hewlett-Packard Company (Palo Alto, California) PA-8000 processor. ' 
The following discussion explains compiler operation in the context of a loop within an application program. Loopsj 
is are readily recognized as a sequence of code that is iteratively executed some number of times. The sequence of such- 
operations is predictable because the same set of operations is repeated for each iteration of the loop. It is common; 
practice in an application program to maintain an index variable for each loop that is provided with an initial value, and; 
that is incremented by a constant amount for each loop iteration until the index variable reaches a final value. The index, 
variable is often used to address elements of arrays that correspond to a regular sequence of memory locations. Suclv 
20 array references by a loop constitute a significant portion of cache misses in scientific applications. 

In the compiler, it has been found that the low level optimizer component of a compiler is in a good position tp<^ 
deduce the number of cycles required by a stretch of code that is repetitively executed. As discussed above, the con- 
cept of prefetching is not new. 

Nonetheless it is helpful to explain prefetching at this point. For example, assume that the time that it takes to get 
25 a data item back from main memory to cache is 100 cycles, during which time, the processor must wait idly before it . 
can operate on the data. To avoid wasting idle processor cycles on account of data cache misses, it it is desirable to 
initiate retrieval of data that is not likely to be found in the cache, in advance of such data being needed by the proces- 
sor. The compiler can predict which data is needed in advance for loops that access array elements in a regular fashion. 
The compiler can then insert prefetch instructions into loops such that array elements that are likely to be needed in. 
30 future loop iterations are retrieved from memory ahead of time. Ideally, the number of iterations in advance that array., 
elements are prefetched is such that by the time the array element is actually required by the processor, the array ele- 
ment is retrieved from memory and placed in the data cache (if it was not there to begin with). 

In prior art approaches to prefetching, cache alignment is a problem. Another known problem is the overhead of the, 
prefetch instruction itself. These are very important problems. Run time array dimensioning is yet another problem that! 
35 must be addressed. 

For example, in Fig. 3 a loop is shown that has a loop execution time of 10 cycles and that iterates 100 times,- 
accessing an 8-byte array element on each iteration. If there are no cache misses, the total loop execution time is 1 000 : 
cycles. In Fig. 4, a direct mapped data cache is shown where the cache line size is 32 bytes, each line capable of hold-; 
ing 4 contiguous 8-byte array elements. For the loop of Fig. 3, it is assumed that a cache miss occurs every fourth iter-; 

40 ation (on every cache line crossing), which means that 25 data cache misses will occur for the whole loop. If it takes 40 
cycles to service each cache miss, the total loop execution time becomes 2000 cycles, i.e. 1000 cycles for just execut 1 : . 
ing the loop instructions + 25 X 40 cycles, or another 1000 cycles, for the cache misses. <j! _ 

If known prefetch techniques are used, then for the example of Figs. 3 and 4 cache misses can be covered if a; 
prefetch distance of four is chosen. In Fig. 5, a prefetch instruction is shown inserted into the loop of Fig. 3. As can be , 

45 seen, the use of a prefetch instruction can eliminate most cache misses, thereby saving significant execution time. How-' 
ever, a prefetch instruction requires execution time. In the example herein, each iteration of the loop requires a prefetch' 
instruction, which can be assumed to take an extra cycle. Therefore, for a loop that iterates 1 00 times, 1 00 cycles must 
be added to the execution time to account for prefetching. 

Additionally, the first iteration of the loop incurs a cache miss, which in the example herein requires 40 cycles. . 

so Accordingly, prefetching avoids most cache misses, such that execution time is reduced to 1 140 cycles, i.e. 1000 cycles 
to execute the original loop instructions + 100 cycles for the prefetch instructions + 40 cycles for the initial cache miss] 
before the first prefetch instruction is executed. Thereafter, the prefetch instructions overlap the 40-cycle data cache! 
miss service time with the execution of four (1 1 -cycle) loop iterations. , 
Unfortunately, each time through the loop a new prefetch instruction is executed. Where the unit of transfer between- 

55 the main memory and the cache is a cache line, some of Ihe prefetches are redundant because a prefetch for a partic- 
ular array location may refer to the same cache line as the prefetch for subsequent array locations. This redundancy 
occurs because there are adjacent array locations in the same cache line, and the system is issuing a redundant 
instruction to the memory system to retrieve the same cache line multiple times. Typically, computer systems that sup- 
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port, this type of prefetch instruction track the instructions to determine if a requested address to prefetch a cache line 
matches a later prefetch to the same cache line. In such event, the second prefetch request to main memory is dropped. 
• liHowever, even though redundant prefetches typically get dropped, it is nonetheless important that prefetch instruc- 
tions that refer to the same cache line are not executed multiple times because the prefetch instruction itself takes up 

s some compute time. The processor must fetch and execute the prefetch instruction, understand what data address the 
instruction refers to, and then access the data cache to check if the data are already cache-resident. 

Note that the compiler is responsible for inserting prefetch instructions into a loop body that specify the memory 
address of data items that will be accessed in the future. The memory address is determined based on the number of 
loop iterations in advance (i.e. the prefetch iteration distance or PFID) that data items need to be prefetched to fully hide 

70 the time required to service potential data cache misses. The PFID is determined taking into account the nature of the 
loop body instructions and characteristics of the target processor and memory system. For instance, for a "short" loop, 
e.g..one that takes only two cycles per iteration to execute, the PFID would need to be 50 in order to accomodate a 100- 
cycle data cache miss latency. 

■The key to efficient data prefetching then is to overlap the computers execution of the instructions in a piece of 

is code, such as a loop with the time it takes to retrieve the data from the memory and place it to the processor cache, 
and do this in a way that avoids redundant prefetches. 

Ideally, cache miss overhead is completely eliminated by inserting prefetch instructions judiciously. Referring back 
to the example above where the loop executes 100 iterations, with each iteration taking 1 1 cycles each (10 cycles for 
the original loop body instructions + 1 cycle for the prefetch instruction + 40 cycles for an initial cache miss before 

20 prefetching starts), the time it takes to run the loop is only 1 1 40 cycles, which is much better than the 2000 cycles of the 
example above in Fig. 4. 

However, 1 1 40 cycles is still not quite as good as 1 000 cycles. One way to increase further the savings in processor 
execution time is by using a well known technique referred to loop unrolling. In loop unrolling, the body of the loop is 
replicated. This reduces the number of times the loop is executed by a factor that is equal to the number of replications, 
25 although each time the code is executed there are more instruction to exectue. Thus, in loop unrolling exactly the same 
amount of work is accomplished, but the loop is now reorganized. 

:Note however, that because the loop closing branch doesnt not have to be replicated as many times as the loop 
body is unrolled (in fact an unrolled loop typically needs just one loop-closing branch), loop unrolling can by itself result 
in improved performance. 

30 For example, Fig. 6 shows the loop of Fig. 3 after the loop has been unrolled four times. Thus, instead of executing 
the loop 100 times, the loop is executed 25 times. Assume that a loop-closing branch takes 1 -cycle to execute. Each 
iteration in the unrolled loop would then require 37 cydes (4x9 cycles + 1 cycle) and the total loop execution time is 
equal to (25 iterations X 37 cycles) + (25 iterations X 40 cycles/cache miss) = 1925 cycles. 

■Jn the context of the invention and per the example above, if each iteration of the unrolled loop requires 37 cycles, 

35 where the loop is unrolled four times, it is necessary to prefetch data two iterations ahead (since 1 iteration ahead is 
insufficient to accomodate a 40 cycle cache miss latency). If the prefetch instruction is put at the bottom of the loop, 
then the loop is executed before a prefetch is performed. This does not provide optimum operation of the loop. Thus, 
the placement of the prefetch instruction is critical. It is therefore necessary to place the prefetch instruction at a point 
that provides sufficient time for a prefetch before the loop completes execution. For example, if the prefetch is placed at 

40 the top of the loop, then the loop does the same amount of work, but more effectively overlaps the time to service a pos- 
sible data cache miss for subsequent iterations with the computation performed in the current iteration. 

;For the example above, where there are 100 iterations of a 10-cycle loop that takes a total of 1000 cycles, the 
prefetch instructions cost 100 cycles + a 40 cycle cache miss for the first iteration. As a result, the execution of the loop 
is reduced from 2000 cycles to 1 140 cycles. By adding loop unrolling, in this example where the loop is unrolled by a 

45 factor of four (see Figs. 6 and 7), each iteration of the loop may take 38 cycles (37 cycles + 1 cycle for the prefetch 
instruction). Thus, execution time for the loop is equal to 38 cycles X 25 iterations + 80 cycles for two cache misses 
before prefetching begins = 1030 cycles. Thus, it is clear that the techniques disclosed herein produce a substantial 
improvement in the execution time for a loop. 

Note that in some cases the prefetch instruction may not cost any additional cycles to execute. This is because 

so many modern processors are superscalar, i.e. they can execute multiple instructions in one cycle, e.g. a load with an 
add. Prefetch instructions are similar to a load because they refer to memory. Thus, if there are several adds in a loop, 
adding one extra prefetch does not increase the time necessary to execute each iteration of the loop because the 
prefetch instruction is executed in parallel with the add instruction. 

One important feature of the invention identifies loops and access patterns to allow a determination of how many 

55 cycles are devoted to loop iterations, and therefore allows insertion of the prefetch instruction to a location of an array 
thai: is sufficiently far in advance to make sure that the miss time is minimized. One problem is that loops can be coded 
in rhany different ways. It is therefore necessary to recognize different types of loops. For exampl , there are some 
loops that are not always handled by prefetching. 

j 

•; 
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In the invention, the compiler translates the higher level application into an instruction stream that the processor 
executes where the compiler inserts prefetches at opportune points into the instruction stream that effects data 1 
^TL , : 0m T int ° in advanCe 0f when that data item is a <*"»y needed. The comX arS 

ESS Jl^L** ' tem iS 90inQ 10 be " eeded at 3 particular time " rather than ,eWn 9 *e processor 

m( °" e ad ^? ta9e in ' ettin9 the P [ ocessor continue executing the other instructions which may have nothing to do with' 

to s K! E ? SyS r T T an ° Ver,ap - WhHe th6 aCC6SS lime be,ween Pressor and cache is typically i 
to 5 cycles, the retrieval time from cache to memory is often on the order of 1 0 to 1 00 cycles. When the processor actui 

2S il ? T ^i 61 " 6 .* 116 data rtem is needed ' rt is no* necessary to waitforacache miss that takes 100 processor 
' . ^IT 9 100 processor c > rcles - ft ma V on| y °e necessary to wait for 20 cycles because 80 cycles 
worth of look up time is hidden or overlapped with the previous execution 

nnn^?Jr en, ? n * in ^ e low level °P« m ^er of the compiler to insert prefetch instructions at 

opportune points in the code. In particular, the invention inserts prefetch instructions into loops. One advantage of 
TT* 1? ,nstructions int0 '^P 5 is that the dat a Terence pattern of a loop tends to be regular and the compiler 
is better able to predict the kind of memory items that are likely to be required in the future, where the future is no^the 
£TJL! the J* ^ ,ive or six iterat,0lls in *e future. As discussed above, this depends on the characteris- 

TJEz S2?s!iir , r* the ,oop - 11 is therefore necessary ,o vary tne ^ in *~ ** *• p refefc n 

is actually issued, based on the expected latency of a loop iteration 

20 imJ^T?^ iS PiCCe ° f SOftWare th3t translates source ^e, such as C, BASIC, or FORTRAN, into a binary 
image that actuaHy runs on a machine. Typically the compiler consists of multiple distinct phases, as discussed above 

?J?n^Z 9 ' !J ?"?. P u teSe iS ref8rred t0 35 the fr0nt end " and is »*P°nsible for checking the syntactic cor? 
rectness of the source code. If the compiler is a C compiler, it is necessary to make sure that the code is legal C code" 
« w 1 f° 3 Station phase, and the interface between the front-end and the code generator isl high level 

lETT 8 reP l eS l nt ? i ° n - ^ ,6Vel intermediate representation * a more refined series of SZ«3 
need to be carried out. For instance, a loop might be coded at the source level as- ' 
for(l = 0, 1 < 10. 1=1+1). 

2l2LTi!? H fa 2 b6 u b ? en d0Wn int ° a Series of steps> e 9 - each tme throu 8 h " 1,16 l0 °P ,ir st load up I and check ii 
tabon and transforms it into a low level intermediate representation. This is much closer to the actual instructions that 

ate S^Sr^^ST^! impf0Vin9 qUa ' ity ° f th6 intermediate representations, a low level intermedin 
ate representation generated by a code generator is typically fed into a low level optimizer 

that »r'lS , ^ mPOne,Tt ° f lf°"^ iler r™ 51 Preserve the program semantics (i.e. the meaning of the instructjonsl 
that aretranslated from source code to an high level intermediate representation, and thence to a low level intermediate! 

ter to execute an equivalent" set of instructions to be executed in less time « 
Modern compilers are structured with a high level optimizer (HLO) that typically operates on a high level intermedii 
2LTT " w° n ,f nd „ SubStrtutes in rts P*» a ™re efficient high level intermediate representation oVa %SaX 
gram that is typically shorter. For example, an HLO might eliminate redundant computations 1 
Wrth the low level optimizer (LLO), the over-arching objectives are largely the same as the HLO, except that the 
hnS.l fh? "J! repre i entation of »» P r °9 ram •* is much closer to what the machine actually understands.. 

^T" pertorrns P™*» analysis ^ prefetch instruction generation in the context of a low level opti-: 
J? * ^ n0t 3ny S6mantiC ^"^'ons. ^ merely instructions, such as add, load, and store The 
« aZtSe^oZze" "* ° SeQmentS ' *"* 35 ^ ^ PrefetCh inStrUCtion in the ^Ixtof 

The analysis that the compiler herein uses is simpler than that of the prior art. Additionally, because the invention 
operates ,n the context of a low level optimizer on raw instructions, it is much easier to estimate how many p^or 

-ierfeo 3 irTtheTcSdT "* ^ iterati ° nS in advance a data Pr eteteh instr "0«on shouW be 

There are many different organizations that are possible for a data cache, but one possible organization that is not' 

tSZZSZT 'Vh T^i* t6rmS ° f 3 Seri6S " di ^"*PPed cache lines. £h cachefne STaSe to' 
how upto 32 bytes of data, such that the unit of transfer between main memory and the cache is in chunks of 32 bytes 

nX^TT 3 ,0 th6 m6m0ry SyS,em - and * Bno> data is P |aced irt o a well defined location^ 
» ^^r Pr0CeSS °: ° re u neVe ^ d3ta fr ° m ^ ,0catioa » the rache in ^ exa rrP'e is 32.000 bytes in size 

ESllTLlSL^t ^ 6 MinQ 32 byteS " ^ ^ data prefetchina "'lately has to iiert Te 

hartware prefetch .nstructons into the low level code representation. One distinguishing feature of the current Mention 
* tat the analysis required to insert prefetch instructions efficiently is also done in the context of a i?SJc3S?"' 
Moreover, the prefetch instruction insertion is done a manner that is synergistic with other low level optimization such i 
as loop unrolling, register reassociation. and instruction scheduling. °Pnmizat.on. such 
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The invention herein resides within the domain of the low level optinruzer 24. fig 8 ■ a block d.agram showng a 
low level optimizer for a compiler, including a prefetch driver 34 according to the invention. . ^ _ _ 

The low level optimizer 24 in accordance with the preferred embodiment of the invention may .nclude any oonto- 
^^«3SSTJ3iqu» such as those that provide for local optimization 35 ^ 
• Sen«Son37 loop invariant code motion 38, loop unrolling 30, reg.ster reassoc.at.on 31 , and nstruction sched 
SSSides a prefetch driver 34 that operates in concert with such known techniques. 

The following pertains to the various elements of the low level opt.rn.zer shown on F.g. 8. 

. Local optimizations include code improving transformations that are applied on a basic block by basic Wocktas^ 
^uro^es of the discussion herein, a basic block corresr^nds to the longest ^ 
r^^SS or outgoing control transfers, exduding function calls. Examples ****** 
JSS common sub-expression elimination (CSE). local redundant load ehrmnatton. and peephole 

optimization. 

, 5 . Global optimizations include code improving transformations that are applied based on an ^^"J™^ 
te^LS boundaries. Examples include global common sub-expression elim.nation. dead code el.rn.nat.on. and 
register promotion that replaces loads and stores with register references 
" . Loop identification is the process of identifying sections of code that get executed repetitive.y (typically this is done 
20 through interval analysis). 

. Loop invariant code motion is the identification of instructions .ocated with a What compute the same result on 
every loop iteration and the re-positioning of such instructions outside the loop body. 
x . Registerallocationandinstructionschedulingistheprocessof assigning ha^re registers to symbolic instruction 
operands and the re-ordering of instructions to minimize run-time pipeline stalls. 

Rg 9 is a block diagram of a prefetch driver according to the invention. In the «9^e «*in low level op* 

^ TeTrefetcrTdriSr estimates the prefetch distance 93 and then partitions the memory references occurring in each 

the iow level intermediate representation and prefetches that have been .nserted .nto the .ntermed.ate representation 
in accordance with the invention herein. 

The following detailed description pertains to the various modules shown on Fig. 9. 

|l . Loop body analysis (see Rg. 10. on which the letters G and H correspond to the same letters on Fig. 9): 

! a Identify region constants 190. These are pseudo-registers (symbolic instruction !^ 5£"*2^£ 
: only used and not defined in the loop body. For the purposes of prefetch.ng analyse, only .nteger reg.on con 
starrts are of importance. 

I b Identify simple basic loop induction variables 1 91 . A simple basic induction variable (B.V) is a pseudo-regis- 

II ter whose loop body def initions can all be expressed in the form: 

50 - where Z^lTanyT^c expression involving only pure or region constants. 

i STilSEv is a region constant, and the variable T is a B.V. with a single loop body def.nrt.on whose 
"biv_delta" term corresponds to (2 * k) 
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k = ... 
i = ... 
loop 

i = i + (2*k) 
endjoop 



The net loop increment for a BIV is the totai amount by which the BIV is incremented on every loop itera- 
tion. 

A BIV is said to have a well-defined loop increment if the total amount by which the BIV is incremented is 
the same on every loop iteration.The BIV loop increment in this case is simply the sum of the "bivjdelta" values 
15 associated with each of its loop body definitions. 

Note that a BIV with conditional loop body definitions does not have a well-defined loop increment. 

c. Compute and linearize address expressions for memory references 192. This involves first identifying the 
address expression associated with memory references. Typically this is done by recursively tracing back the 

20 reaching definitions for the register operands of base-relative and indexed loads and stores that appear in the 

loop body, and constructing a binary address expression tree, where the internal tree nodes represent simple 
arithmetic operations (+,-,*) and the leaf nodes represent either a pure constant, a region constant, or a BIV. . 
The traceback terminates unsuccessfully when a non-BIV register operand has multiple reaching definitions or 
when the address expression can not be expressed as a simple binary expression tree for any other reason. 

25 Memory references whose memory address can not be expressed as such a binary expression tree are not 

further consideredfor prefetching purposes. 

The address expression tree is then linearized, if possible, with respect to a unique BIV, meaning that it is re- 
30 written into the form: 

a_exp * BIV + b_exp 

where "a_exp" and "b_exp" are themselves arithmetic expressions involving just sum terms each of which is a. 
product involving either literal or region integer constants. The "BIV" term refers to the value of the basic induction^ 
variable at the top of the loop entry basic block (the basic block that is the target of the branch representing the back 
35 edge of the loop). 

An address expression that can be linearized in this manner is considered to be "affine". The "a_exp" term of 
an affine address expression multiplied by the BIV's net loop increment is also referred to as the memory stride. 
Also, associated with each such memory reference is a memory data size that can be inferred from the memory 
reference opcode e.g . a full-word load would be considered to have a data size of 4-bytes. 
40 Memory references with affine address expressions involving a BIV with a well-defined net loop increment that 

is a compile-time constant and whose "a_exp" term is non-zero are the only memory references that are further 
analyzed for data prefetching purposes. 

In the example below containing indexed references to 4-byte integer arrays, A, B, C, and D, 



loop 

.. A[i + 4] ... 
.. B[2*i-2*k+8]... 
-C[D[i]]... 
i = i+ 1 
endjoop 

55 

the variable V is a BIV and the address expressions associated with the references to A. B, and D would be con- 
sidered affine, since their memory addresses can be expressed as: 

(4)*i + (l6 + &A[0]) 

(8)*i + (32-8*k + &B[0]) 
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(4)*i + (&D[0D 

respectively, where the notaton, "&X[0]" refers to the region constant variable that represents the address of the 
zero'th element of array X. 

Note, however that the address expression associated with the reference to array C is not considered affine. 
2. Unroll the loop if possible: 

a. Compute maximum prefetch unroll factor U. The objective here is to determine the largest unroll factor U that 
can be used to minimize prefetch instruction overhead without causing memory strides that are less than or 
equal to the data cache line size to exceed the data cache line size. 

The maximum prefetch unroll factor, U is computed as follows: 
U = loop unroll factor computed by other criteria (e.g. loop body size, expected trip count, trip count divisibility 
etc.) 

For each affine address expression reference associated with a BIV with a well-defined constant net loop 
increment "net_loop_delta" do 

{ 

memory_stride = a_exp * net_loop_deita 

if (memory_stride is a compile-time constant && 
ABS(memory_stride) <= cache_line_size) 

{ 

u = cache_line_size / (ABS(memory_stride)) 
U = minimum (U, u) 

} 

} 



b. If U > 1, then unroll loop body U times and repeat loop analysis (steps a-c) on unrolled loop body. 

3. Estimate the minimum required prefetch iteration distance (PFID): 

The prefetch iteration distance is the number of loop iterations in advance that data should be prefetched to 
have the data be available in the cache when it is needed by the processor, assuming the data was not in the cache 
to begin with. The PFID is computed based on the expected cache miss latency and the minimum resource-con- 
strained latency for each loop iteration as follows: 

PFID = ceiling (avg_miss_latency / avg_loop_iterationJatency) 

There are two competing constraints with regard to the PFID choice: 

First, the PFID should be sufficiently large to hide the expected average cache miss latency. 

Secondly, the PFID should not be so large that the prefetched data is displaced from the cache by an interven- 
ing colliding memory reference before it is actually referenced. 

ft is difficult to determine the optimum average expected memory latency. While the best-case round-trip mem- 
ory access latency on one system may be, for example about 50 cycles, it is likely to be different on another system 
that uses a slower bus. Furthermore, bus contention and memory bank conflicts tend to increase the memory 
access latency. 

Nevertheless, the average miss latency is heuristically estimated as the minimum number of processor cycles 
that elapse between the time a request is sent by the processor to the data cache and the time the data is for- 
warded to the processor, assuming the data was not present in the cache. 

Estimating the average loop iteration latency is even harder to do, even for single-basic block loops. Until 
scheduling and register allocation are performed, it is not possible to know for sure how many cycles a loop iteration 
is going to take. Because it is expensive and difficult to compute the achievable loop iteration latency precisely, a 
lower bound on the achievable loop iteration latency based on machine resource usage is computed instead. This 
is quite effective for superscalar processors that execute instructions out-of-order and are able to overlap operation 
latencies at run-time. Typically for such machines, instruction retirement bandwidth constrains the execution cycle 
count the most. Thus, by focusing on th retirement bandwidth requirements of the instructions present in the loop 
body, a lower bound on the achievable loop iteration latency can be computed. 

Certain instructions that are likely to be eventually deleted should be ignored in computing the loop iteration 
latency estimate. These may include register-to-register move instructions, subscript instructions that may be elim- 
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inated by reassociation, and floating-point multiplies and adds that may be fused into floating-point multjply-and- 
accumulate instructions. 

For instance, suppose a target out-of-order processor can retire two memory instructions and two ALU or float- 
ing-point operations per cycle and suppose the loop body code consists of: 

5 

5 memory operations, 

6 ALU operations and 

7 floating-point operations 

w and that three of the ALU operations participate in addressing expressions that are likely to be eliminated through reg- 
ister reassociation. The lower-bound on the loop iteration latency would then be 5 cycles computed as the larger of 5/2 
and ((6-3) + 7)12. 

Now, it is also necessary to address the issue of loops that have internal branches. The minimum loop iteration 
latency for such loops is estimated by using previously collected execution profile information, which indicates the 
15 execution count for each basic block in the loop body. The minimum cycle count for each basic block is computed 
based on the retirement constraints for the instruction mix within the basic block. 

The minimum cycle count is summed over each basic block that is executed more than half as many times as 
the loop entry node to yield an estimate for the minimum loop iteration latency. 

20 4. Identify equivalence classes: 

To decide the sort of explicit prefetch instructions to insert into the loop body, uniformly-generated equivalence 
classes of memory references are first identified. These are basically disjoint sets of memory references whose 
address expressions are known to differ by a compile-time constant. This is done to help clearly detect group spa- 
tial and group temporal locality among the different memory references, which in turn can help reduce the prefetch 
25 instruction overhead. 

Place each affine address expressions associated with a BIV with a compile-time constant net loop increment 
in a distinct group such that all address expressions within a group share the following properties: 

they are all associated with the same BIV 
30 they all have the same "a_exp" term 

their b_exp" terms differ by a compile-time constant 

The following algorithm is used to do this: 

35 - let the set of uniformly generated equivalence classes, UGEC = { } 

add each affine address expression E(biv,a_exp,b_exp), to a work list W. repeat 
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- remove an address expression Ei(biv, a_exp, b_exp) from the work list, W 

- compute the memory stride M for Ei as (a_exp * net biv loop increment) 

- if M is not a compile-time constant then substitute some fixed constant, C for 
each non-compile-time constant pseudo-register P, that occurs in M and 
compute the constant-folded memory stride M' 

Also, replace with C, occurrences in "b_exp" of any non-compile-time constant 
pseudo-register P that occurs in M. and constant-fold "b_exp" to yield b_exp' 

- if M is a compile-time constant, then let M' = M and b_exp' = b_exp 

for each existing equivalence class Q in UGEC 
{ 

- choose any representative address expression 
Er(biv, a_exp, b_exp') belonging to Q 



- if biv and a_exp of Er and Ei are not identical, move on to the next 
equivalence class 

- symbolically subtract the b_exp' expression for Ei from Er to obtain S 

- if S is a non-zero compile-time constant, add Ei to equivalence class Q 
and move on to consider next address expression on the work list W 

} 

- if Ei was not added to any existing equivalence class, then add a new 
equivalence class X to UGEC and add Ei to X. Also associate M' with the 
newly created equivalence class X 

} 



until work list W is empty 

Consider each equivalence class. Q. in turn and do the following: 

5. Sort the address expressions within each equivalence class based on their b_exp' terms and replace multiple 
address expressions with identical b_exp' terms with a single representative address expression. 

Since by construction, the address expressions belonging to the same equivalence class differ in their b_exp' 
terms by a simple constant, it should always be possible to sort them based on increasing b_exp' values. 

Let "EJow" be the address expression in Q with the lowest b_exp' value. Compute a relative equivalence class 
offset. "eq_offset" for each address expression E in Q, as: 
E.eq_offset = E.b_exp' - E_low.b_ xp' 

6. Compute prefetch instructions needed to ensure full cache miss coverage for equivalence class Q. 

The goal here is to insert the fewest number of prefetch instructions in the loop body to ensure that in the 
steady-state, a prefetch is issued for every distinct cache line referenced by the address expressions in the equiv- 
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al nee class Q. Unnecessary prefetches are avoided if possible by exploiting any group-spatial or group-temporal 
locality that may be apparent among the memory references within each equivalence class. 

The method of determining the fewest number of prefetch instructions needed to ensure full cache miss cov- 
erag depends on the magnitude of the memory stride, M', associated with the equivalence class. 

If M" is <= cache line size, then a prefetch strategy suited to small strides is employed, otherwise a prefetch 
strategy suited to large strides is used. Note that for large strides, cache line alignment of data elements needs to 
be considered. 

In either case, it is first necessary to identify clusters of references within the uniformly generated equivalence 
class. A duster consists of one or more memory references that occur consecutively in the equivalence class list 
sorted on "eq_offset", with a well-defined cluster leader. The cluster leader is used to generate prefetch data on 
behalf of all members of the cluster. The objective here is to weed out those refs within an equivalence class that 
trail other refs within the equivalence class. The refs that are still left standing are essentially cluster leaders. 

The manner in which memory references are grouped into clusters depends on the relative size of the memory 
stride as compared to the cache line size. 

a. Cluster identification for small stride equivalence classes. 

It is necessary to consider the address expressions in the equivalence-class in the order of increasing 
"eq_offset" values and determine whether each address expression trails the very next address expression in 
the equivalence class and if so, drop it from the equivalence class. 

Let B(i) and B(i+1) be adjacent memory refs within the sorted equivalence class list. When the memory 
stride is <= cache line size, B(i) is considered to be in the same cluster as B(i+1), and therefore omitted for 
prefetch consideration iff 

I B(i+1 ).eq_offset - B(i).eq_offset | <= prefetch memory distance 
where the prefetch memory distance is computed as the product of PFID and the effective memory stride, IvV 
for the equivalence class. 

The logic behind this is that if B(i+1 ) leads a reference B(i) by less than the prefetch memory distance, then 
there is no real point in inserting a prefetch instruction on behalf of B(i). While some of the initial PFID execur 
tions of B(i) within the loop may suffer cache misses, subsequent executions of B(i) would either find its data in 
the cache or have to wait much less than a full cache miss latency for its data to be retrieved from main mem- 
ory, since B(i+1) or a prefetch associated with B(i+1) would have initiated the memory retrieval earlier in time. 
[This is of course assuming that conflict/capacity misses havenl displaced the data from the cache by the time 
B(i) catches up with B(i+ 1 )] . 

The last PFID loop iterations can be peeled as described in Mowry et al to avoid the overhead of redundant < 
prefetch instructions that would be executed for data elements not accessed by the original loop. 

As shown on Fig. 9, the module that computes the prefetch instructions necessary to determine equiva- 
lence class 96 is identified by the letters C and D, which letters are used to indicate a more detailed explanation 
of the module, which is shown on Fig. 1 1 . In the figure, the prefetch driver is shown to comprise a module com- 
putes the prefetches that are needed 96, where small stride prefetch candidates 201 and large stride prefetch 
candidates 202 are identified in accordance with a detection module 200. The algorithm for cluster identifica- 
tion with small strides is given below: 

for each address expression "p" in the current equivalence class in sorted order, except the very last address 
expression 
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- let q be the next address expression in the equivalence class 

- if ( ABS( q.eq_offset - p.eq_offset ) <= ABS( M' * PFID)) 

{ 

remove p from equivalence class list 

} 

else 
{ 

- mark p as a leader of a cluster 

- let p.trailing_offset = p.leading_offset = p.eq_offset 

} 



mark the very last address expression P in the equivalence class list as a cluster leader 
let p.trailing_offset = p.leading_offset = p.eq_offset 

The address expressions remaining in the equivalence class after this weeding out process are all cluster 
leaders. 

b. Cluster identification for large stride equivalence classes. 

The algorithm for detecting the fewest number of prefetch candidates needed for an equivalence class with 
a large stride is unfortunately a bit more complicated than the one used for small- stride equivalence classes. 
The primary reason for this is that with large memory strides, the relative cache line alignment of the memory 
refs becomes important. For instance, consider the following "C" loop nest: 

int A[100][100]; 

for (i = 0; i < 100; i++) 

for(j = 0; j < 100; j++) 

{ 

..mi- 

..A[fl[i+1]... 

} 



The above source code fragment strides through the array A in large increments for each iteration of the 
inner j-loop. It must be determined whether it is sufficient to insert only one prefetch instruction on behalf of 
Afj][i+1], with the assumption that the AQ][i] reference is a trailing reference. The answer is no because the two 
references could straddle a cache line boundary. If this were the case, then the references to Afj][i] could miss 
the cache, possibly on every iteration of the j-loop, even though data is prefetched for the Afj][i+1] reference. 

However, this is not to say that there is no hope of sharing prefetch instructions among references within 
a uniformly generated equivalence class with a large stride. For instance, if the first reference in the Hoop 
above had been to ATj-1][i+1] instead of AQ][i], clearly one prefetch instruction would be sufficient for both ref- 
erences. 

To make the system immune from the vagaries of relative cache line alignment of references within an 
equivalence class, yet at the same time exploit obvious temporal locality among the references, a two-pass 
strategy is used. This strategy is shown in Fig. 12. In the first pass, it is necessary to id rttjfy clusters of adja- 
cent references within the current equivalenc -class, that are sorted based on their eq_offsets 205. The distin- 
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guishing feature of each such cluster is that the references within the cluster share group spatial locality but no 
group temporal locality. 

The leading reference within each such cluster is responsible for prefetching data both for itself and every 
other reference in the cluster. To accommodate bad breaks with cache line alignments, a cluster leader may 
give rise to multiple prefetch instructions, each spaced a cache line apart from the next, until the ent.re span of 
the cluster is accounted for. To better explain this, consider the following simple "ocp.wh.ch .s posstoly the 
result of the loop unrolling step, where "A" is a double-precision array variable, i.e. w/8-byte elements. 

ijoop: 

A[i] = 
A[i+1] = 
A[i+2] = 
A[i+3] = 
A[i+4] = 
A[i+5] = 
A[i+6] = 
A[i+7] = 
i = i + 8 
endjjoop; 



First of all because the loop BIV. T. has a net loop increment of eight, and the element size of "A" is 8- 
bytes. this is a large stride equivalence class, assuming a 32-byte cache line size (8 x 8 bytes = 64 bytes) > 32 

byte AII eight references to "A" are placed into the same cluster because they exhibit group spatial locality and 
no group temporal locality. The cluster leader is the reference to A[i+7], and the span of the cluster is 64-bytes 
(i e &AH+7] - &A[i]). If the prefetch memory distance was computed earlier to be 1 28-bytes. i.e corresponding 
o a prefetch iteration distance of two, it is only necessary to insert three prefetch instructions to account for the 
entire span of this 8-member cluster. These three prefetches essentially prefetch the following array elements . 

prefetch AFJ40+1 6] ; p/f for cluster stragglers 
prefetch A[i+3+1 6] ; p/f for cluster stragglers 
prefetch Afj+7+1 6]; p/f for cluster leader 

Regardless of the cache line alignment of the cluster leader, these three prefetch instructions ensure that 
all the cluster members have memory transactions initiated for data that they reference two Rations , .n 
advance. To simplify the actual generation of these types of prefetch instructions, cluster ^ere are 
removed from the equivalence class right away, and the cluster span for the representative cluster leader is 
recorded. 

b.i. The high level algorithm for the first pass is shown below: 

- let q = last address expression in the current equivalence class 

- mark q as a cluster leader 

- let q.trailing_offset = q.leadingjoffset = q.eq_offset 

for each address expression "p" in the current equivalence class considered in backward sorted order, 
ignoring "q" 
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- compute distance from current cluster leader, distj as follows- 

distj = ABS(q.eq_offset - p.eq_offset) 

- compute distance from current cluster trailer, dist t as follows- 

dist_t = ABS(q.trailing_offset - p.eq_offset) 

" £££££ byq !"° C ° ndl,i0nS ,0 dere ' mine " P Sh0U ' d bS inC ' Uded in ,he 

a) dist_t <= cache_line_size 

b) dist_l < M' 

if both these conditions are met. then 

- let q.trailer_offset = p.eq_offset 

- remove p from the equivalence class 



else 



- let q = p and mark q as a cluster leader 

- let q.trailing_offset = q.leading_offset = q.eq_offset 



t^r^r^ , 6ade ? in th6,irSt in the second P 355 ' the a, 90rthm attempts to exploit 
temporal localrty between clusters (see the module identified by numeric designator 206 on 12) This 

SSiT^.f J2T 1° ^ "'r*" 1 US6d t0 iden,ify Prefeteh ^idates for small-stride ^uivalence 
: ° la ^ s ^ s 'f asmeasurS 

££2 i ?!l! 9 C US,er " S ' eader "* ' eadin9 duster ' s w * *« *• trailinTcVuster 2 ate 
removed from further prefetch consideration. y 

However, rather than simply forgetting about the trailing cluster, it is necessary to merae the trailino dii^rwith 

duster s span .s used later on to determine how many prefetch instructions are actually needed fESI that £e £ 
of a merged duster can not be allowed to exceed the effective memory stride NT bemuse X 

SSiSft !! " eed,eSSly : nS6rted for ** mer96d C,USter ' s frailin 9 referen ^- 'n^ad. a merg^usterS^ 
clamped to be no larger than the effective memory stride 

npn^i 0 ^^^ 3 ^ c ° nsideration in decidi "9 to merge clusters is whether the merge would be profitable In 
general, for a cluster C whose span .s "C.s" bytes. "C.p" prefetch instructions are inserted. »C.p" can bTco^uteS 

C.p = [ceiling (C.s/L)] + 1 

SSS,^ 8 , 680 ? Si2e - " 3 dUS,er C h3S " C b " re,erences within * then unless (C.p <= Cb) there is no 

t.on of dusters ,n the f.rst pass, as well as in the merging of clusters in the second passTne P S2S as 

Kt£ T S T eStS th3t n ° "° r6,erenCeS Withi " a c'uste'should tegreZlZ 7™*l 

ZnZ^? ^ aUSe ' ^ heW,Se ' " W ° Uld be to break the cluster into two sub-clusters. Furthermore 
span of the cluster, as .n the d.stance between the candidate cluster trailer and the current cluster leacl^ust not 

cST^ a r ,han " e l ffective memory strkle M - as explained before - A ^ tJSZSZZ'SSZ 

clamped to be no greater than the effective stride for this reason mereiore 
In the second pass, when deciding whether to merge clusters, the number of prefetches that would be needed 
so r^e^ 

One subtlety to the cluster merging pass, is that it may sometimes seem unprofitable to merge a cluster Cfi) 
wrth the next lead.ng cluster C(i + 1). even though in a larger context it would have been paflattZSjSS 
C(, + 2). assuming that C(. + 2) and C(i) are less than the prefetch memory distance apart Thus non-aSen duT 
ters are exammed for merging purposes, and the best cluster to merge with is selected 
55 K~JL rt '? dead i d t0 mer9e 3 duster C(i) with ano,her cluster c ©- the merged cluster leaders relative offset mav 
S^T^^ ^ rela ^ ^ ° f C(i) S ' eading re,erenCe " ^ *• 2" °< a merS S ^y telgS 

£^££T • . IT me ° r,Qinal re ' atiVe ^ 01 the Cluster ,eader * cluster T*» aSm usi 
foMhe second. ,e. cluster mergmg, pass which is used to exploit temporal locality between clusters is explained 
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Io-hITw SI l 03 ? 5 remaining addr6SS ex P ressions in »» equivalence class, each of which is a cluster 
Sf .n 1. n "£? <ir ? ? aSS '- ,inear,y fo,WardS and '° 0kS for mer 9 ,n 9 opportunities wrth the other leading cSs- 
SsitilVslnde ,S aSSUm6d ^ SimP ' iCity ^ the Uni<0rmly 9enerated class has a 

a , rrrw C r SSary « 0ndit i°, n f ° r 3 trai "' n9 duS,er C(i) to be e,i 9 ible for bein 9 mer 9ed w *h a 'eading cluster CG) is that- 
a) [CO).trail.ng_offset - C(i).leading_offset] < pf_memory_distance. U ' 

Otherwise, C(j) would be leading cluster C(i) by more than is desired 
Now, let: 

the ea ^SK^^ ; C ^ ,eadin 9- offeet ) ' M l ""ere "trailing_offset« and "leading_offset" refer to 
the eq_offset of the cluster trailer and leader respectively. Note that the trailing cluster may be the result of a previ- 

m SSS ST 50 * " tra " in9 - 0ffSer n0t COrreSpond to a referenced adually a^eCn 

Clearly, it is required that 0 < m < PFID. 
It is also necessary to define the span of a cluster C(x) to be: 
C(x).s = C(x).leading_offset - C(x).trailing_offset 

^T^X^CMsTLnT^ PrefetCh inStrUCti ° nS " eeded for a distinct clus,er °W is S^en by, C(x).p: 
where L is the cache line size. 

If cluster C(i) is merged with cluster C(j), then the span of the merged duster C(j') is given by- 
size at = {M ' C(j) leadin 9- offset • Ca').trailing_offset} (the MAX operation clamps the merged duster 

where C(j').leading_offset = 
MAX {CQ").leading_offset, (C(i).leading_offset + (m * M'))} 

and C{j').trailing_offset = 
MN {Cffl.trailing_offset. (C(i).trailing_offset + (m * M'))} 

For the merger of cluster C(i) into cluster C(j) to be profitable in terms of redudng the overall number of 
prefetches required, the following is necessary: numrjer of 

C(j").p<=(C(i).p + C(j).p). 
which means that: 

ceiling (CQ'J.s / L) <= (ceiling (C(i).s / L) + ceiling (Cffl.s / L)). 
Let the savings accruing from merging cluster C(i) with Cffl into a combined duster CO be defined as- - 

merger_savings(i.j.j 1 ) = (C(i).p + C(i).p)-(CGTp. 

that SsrcrSon"-^ J?Sr?^ T T *" artiCU ' ated 3 C(0 ' ^ the leadin 9 c,usters - C ® 

that satisfy criterion a . the system chooses to merge cluster C(i) with one of those clusters Cffl for which- 

merger_savmgs (i.j.j 1 ) is maximized and non-negative 

tble 5Sr22 ^rSf I"" 1 " C ° mpUted 38 m6nti0ned ^ presents the minimum positive integral mul^ 
pie of the stride required to achieve an overlap of C(i) with Cffl. it is also necessary to checkwhether uSna tin'. ' 

cluster C(.) ahead by (m -1 ) rterat.ons definitely does not cause it to overlap with Cffl. the projected C(i) leader mav 
222! *»™ trai ' er ' «*9 the s ^ to expend fewer prefetches than if Cffl S^Sead by 
one more rteratton. Th.s can be espedally true rf the strkle is much larger than the individual cluster spans 
n»« n S ' th ^ al9 ° r,tnm J cans remaining address expressions, which represent clusters identified in the first 
ahpa'ri h T 0rder , afXi ™ r * e *** with a selected leading cluster by projecting the trailing duste 

ahead by either m or (m - ) rterations and checking if both criteria "a" and v apply. To project a trailing duster Cffl 
ahead m rterations to evaluate whether it should be merged with a leading cluster Cffl the system 
tentative "lead.ng_offset" and "trailing_offset" values for the proposed merged cluster CO") thusly 
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let CG')-teading_offset = C(i).leading_offset + (m * M') 



let CQ').trailing_offset = C(i).trailing_offset + (m " M') 
if (M 1 > 0) 

CG') leading_offset = MAX (C(j').leading_offset, C(j).leading_offset) 
CG') trailing_offset = M!N (C(j').trailing_offset, CG).trailing_offset) 
} else 

CG') leading_offset = MIN (C(j').leading_offset, CG).leading_offset) 
CG').trailing_offset = MAX (C(j').trailing_offset, C(j).trailing_offset) 
} 

adjust C(j').trailing_offset if needed to ensure CG').s does not exceed M' as 
follows: 

CG')-s = ABS ( CG').leading_offset - C(j').trailing_offset ) 
if (CG')-s > ABS(M') ) 

CG').trailing_offset = CG').leading_offset - M' 
CG')-S = ABS(M') 

} 



When a cluster C(i) is successfully merged with a cluster C(j) into a cluster CO'). CO) is removed from the equiv- 
alence class list. 

7. Generate the prefetch instructions required for each remaining cluster leader. 

It is necessary to consider each cluster leader in turn, and where "tra.hng_offset .s drfferent than 
"leadingoffset" for any cluster leader, insert as many prefetches as needed to cover the cluster s entire span. i.e. 
from "leading offset" down to "trailing_offser. each prefetch instruction address spaced L bytes apart 
! More sptifically. if the memory reference corresponding to a cluster leader .s represented by the instruction. 

where W ^displacement value and Rb and Rt are pseudo-registers correspond^ to the base register and 
Target regEer of the load, then one or more prefetch instructions are inserted into the code stream as follows. 

prefetch inst newjdisp (Rb) . „ . ... 

where "new disp" is computed as disp + (M*PFID) + pf.disp. where "pf.disp" represents the ^^^"^ 
needs to the'added to the memory address referenced by the cluster leader, to form the base address from wh.ch 
it is necessary to prefetch ahead by the prefetch memory distance. The algorithm used to emrt the prefetch .nstruc- 
tions is given below: 



18 
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- let L = cache line size for each remaining cluster C(i) in the equivalence class 

- let disp = displacement of memory reference instruction associated with the 
leader address expression for cluster C(i) 

- if (C(i).leading_offset == C(i).trailing_offset) then 
{ 

pf_disp = 0 

- emit prefetch_inst with new_disp = disp + (M * PFID) 

} 

else { 

if (M > 0) 
{ 

- let cur_offset = C(i).leading_offset 

- let final_offset = C(i).trailing_offset 

} else 
{ 

- let cur_offset = C (i) .trai ling_off set 

- let final_offset = C(i).leading_offset 

} 

- let pf_disp = cur_offset - C(i).eq_offset 



while (cur_offset > finaLoffset) do 
{ 

- emit prefetch_inst with new_disp = disp + (IvTPFID) + pf_disp 

- let cur_offset = cur_offset - L 

- let pf_disp = pf_disp - L 

} 

- emit one final prefetch to the account for the final member of the cluster 
(i.e. the "leader" for a negative memory stride, else the "trailer") with 
new_disp = disp + (M*PFID) + (final_offset - C(i).eq_offset) 

} 

} 



Note that in computing new_disp, the original memory stride value M, computed as (a_exp * net biv loop incre- 
ment), is used and not the constant folded value M'. This may require materializing a run-time region constant 
expression in a register, outside the loop body, and inserting an explicit add instruction within the loop body to form 
the prefetch instruction address thusly: 

Rm = (a_exp * net_biv_loop_increment) * PFID 
loop 

Rx = Rm + Rb 

prefetch_inst new_disp'(Rx) 
load disp(Rb),Rt 
endjoop 



where new_disp' = disp + pf_disp. If the prefetch instruction supports an addressing mode which causes the effec- 
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tive memory address to be computed as the sum of two register values, then the add operation may be omitted by 
folding in the new_disp' value in Rm outside the loop body, yielding Rm'. andspecifying the Rm and Rb registers 
operands of the prefetch instruction directly, as shown below: 

5 Rm' = ((a_exp * net_bivJoop_increment) * PFID) + new_disp' 

loop 

prefetch _inst Rm'(Rb) 
io load disp(Rb),Rt 

endjoop 



However, if "disp" itself is a run-time value, as opposed to a simple constant, then an explicit add operation is 
unavoidable: 

Rm' = ((a_exp * net_bivjoop_increment) * PFID) + pf_disp 
loop 

Rx = Rb + disp 
prefetchjnst Rm'(Rx) 
load disp(Rb),Rt 
endjoop 



and if the prefetch instruction does not support a register + register addressing mode, then two add operations may 
30 be needed: 

Rm = (a_exp * net_biv_loop_increment) * PFID 
loop 

Rx1 = Rb + disp 
Rx2 = Rx1 + Rm 
prefetchjnst pf_disp(Rx2) 
loaddisp(Rb).Rt 
endjoop 



Note however, that these new add instructions may be eliminated through register reassociation. In fact, the 
prefetch instruction(s) and the beneficiary memory reference instruction may be able to share the same base reg- 
45 ister through register reassociation, allowing the add instructions to be deleted: 



35 
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Rp - initialized to the address of the memory location referenced by the load 
on the first loop iteration 

Rm = (a_exp * net_biv_loop_increment) 
Rdelta = (Rm * PFID) + pf_disp 

loop 

prefetch_inst Rdelta(Rp) 
load 0(Rp).Rt 
Rp = Rp + Rm 
endjoop 



Furthermore, if the target architecture supports an auto-increment addressing mode (e.g. PA-RISC, IBM Power 
PC), then the increment of the new base register Rp may be folded into the load instruction itself. 

In terms of the code placement of the prefetch instruction itself, to start with, the prefetch instruction(s) may be 
placed adjacent to the beneficiary memory reference instruction. Subsequently, the instruction scheduling phase 
may re-order the prefetch instruction(s) as needed to improve performance. In doing this, memory dependencies 
between the prefetch instruction and other memory references in the loop body may be ignored and assuming the 
prefetch instruction is guaranteed not to raise an exception, it may be freely scheduled across basic blocks as well. 

Although the invention is described herein with reference to the preferred embodiment, one skilled in the art 
will readily appreciate that other applications may be substituted for those set forth herein without departing from 
the spirit and scope of the present invention. Accordingly, the invention should only be limited by the Claims 
included below. 

Claims 

1 . A compiler (20), comprising: means (34) in a low level optimizer (24) for analyzing and efficiently inserting 
explicit data prefetch instructions into loops of applications. 

2. The compiler of Claim 1 , further comprising: 

subscript expression analysis means (34) for determining data prefetching requirements. 

3. The compiler of either of Claims 1 and 2, wherein analysis and explicit data cache prefetch instruction insertion are 
performed by said compiler (20) in a machine instruction level optimizer (24). 

4. The compiler of any of Claims 1 to 3. further comprising: 

means (93) tor exploiting execution profiles from previous runs of an application during insertion of prefetch' 
instructions into innermost loops with internal control flow. 

5. The compiler of any of Claims 1 to 4, further comprising: 

means (96) for recognizing cache line reuse patterns across loop iterations to eliminate unnecessary 
prefetch instructions. 

6. The compiler of any of Claims 1 to 5, wherein said prefetch insertion means (97) is integrated with other low-level 
optimization phases, wherein said other low-level optimization phases comprise any of loop unrolling, register reas- 
sodation, and instruction scheduling. 

7. The compiler of any of Claims 1 to 6. further comprising: 

means (96) for limiting insertion of explicit prefetch instructions to situations where a lower bound on an 
achievable loop iteration latency is unlikely to be increased as a result of said prefetch instruction insertion. 

8. A method for mitigating or eliminating cache misses, comprising the steps of: 

performing loop body analysis; 

unrolling loops to reduce prefetch instruction overhead; 
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identifying uniformly generated equivalence classes of memory references in a code stream, where said 
equivalence classes represent disjoint sets of memory references occuring in a loop whose address expressions 
, 4 can be expressed as a linear function of the same basic loop induction variable and are known to differ only by a 
compile time constant, allowing the detection of group spatial and group temporal locality among said different 
5 memory references; 

computing an effective memory stride for each of the equivalence classes; 

determining the number of prefetch instructions needed for full cache miss coverage for each equivalence 
class, where the number of prefetch instructions that needs to be inserted is a function of the style of prefetching 
desired, including dumb prefetching that inserts an explicit prefetch instruction for each memory reference baseline 
ro prefetching that inserts as many prefetch instructions as possible without affecting the resource minimum loop iter- 
ation latency, and selective prefetching that inserts as many prefetch instructions as are required to ensure full 
cache miss coverage, exploiting any group-spatial or group-temporal locality that may be apparent among memory 
references within a uniformly generated equivalence class; and 

inserting prefetch instructions identified into sad code stream. 

15 

9. The method of Claim 8, further comprising the step of: 

estimating a prefetch iteration distance for a loop as the ratio of average miss latency and average loop iter- 
ation latency, where the average loop iteration latency is derived from a resource-constrained lower bound on a 
cycle count based on machine resource usage. 

20 

10. The method of either of Claims 8 and 9, further comprising the step of: 

substituting a fixed constant value for unknown terms into the address expressions for memory references 
to run-time dimensioned arrays to facilitate partitioning of such references into disjoint equivalence classes. 

25 11. The method of any of Claims 8 to 1 0, further comprising the step of: 

determining the number of prefetch instructions that are needed for each uniformly generated equivalence 
class for a selective prefetching strategy. 

1 2. The method of any of Claims 8 to 1 1 . further comprising the step of: 

30 sorting the address expressions for memory references belonging to an equivalence class based on their 

relative constant differences. 

13. The method of any of Claims 8 to 12, further comprising the step of: 

determining an effective memory stride for the memory references associated with each equivalence class 
35 and classifying the effective memory stride as being either large or small based on whether it is greater than the 
cache line size. 

14. The method of any of Claims 8 to 13, further comprising the step of: 

determining prefetch memory distance for the memory references associated with each equivalence class 
40 as the product of effective memory stride and prefetch iteration distance for the loop. 

15. The method of any of Claims 8 to 14, further comprising the step of: 

removing memory references within a small-stride equivalence class that trail other memory references 
within said equivalence class by less than the prefetch memory distance, wherein memory references that remain 
45 are cluster leaders; 

grouping the memory references belonging to a large-stride equivalence class that are sorted by their con- 
stant address expression differences into clusters each of which has a distinct memory reference designated as the 
cluster leader and zero or more memory references designated as cluster trailers; and 

merging clusters represented by their leaders to profitably exploit group temporal locality in a pairwise fash- 

50 ion. 

16. The method of any of Claims 8 to 15. further comprising: 

deciding which equivalence classes to insert prefetch instructions for an under the baseline prefetching 
strategy by first sorting uniformly generated equivalence classes based on a prefetch cost/expected benefit criteria. 
55 and only committing to insert prefetch instructions for those equivalence classes with the best cost/expected benefit 
ratio, without causing resource-based minimum loop iteration latency to be exceeded. 
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17. The method of any of Claims 8 to 16. further comprising the steps of: running through clusters in each equivalence 
class; generating explicit prefetch instructions for each cluster; and inserting said prefetch instructions into the code 
stream. 
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(57) A compiler that facilitates efficient insertion of 
explicit data prefetch instructions into loop structures 
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analysis to determine data prefetching requirements. 
Analysis and explicit data cache prefetch instruction 
insertion are performed by the compiler in a machine- 
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Such prefetch instruction insertion strategy tolerates 
worst-case alignment of user data structures relative to 
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tions, The prefetch insertion algorithm is integrated with 
other low-level optimization phases, such as loop unroll- 
ing, register reassociation, and instruction scheduling. 
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